Integrated Classification Likelihood for Model selection in Block Clustering
نویسندگان
چکیده
Block clustering (or co-clustering or simultaneous clustering) aims at simultaneously partitioning the rows and columns of a data table to reveal homogeneous block structures. This structure can stem from the latent block model which provides a probabilistic modelling of data tables whose blocks arise from row and column clusters. For continuous data, each table entry is typically assumed to follow a Gaussian distribution whose parameters are common to all entries belonging to the same block, that is, with identical row and column classes. Several candidate models can be adjusted to a given data table: they may differ in the numbers of clusters or more generally in the number of free parameters. Model selection then becomes a critical issue, for which the tools that have been derived for model-based oneway clustering need to be adapted. We develop here a criterion based on an approximation of the integrated classification likelihood (ICL) of block models, and propose a BIC-like criterion derived from the form obtained. The proposed criteria are illustrated by experiments on simulated data, where their performances are shown to be best reliable for medium to large data tables with well-separated clusters.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملOn Model-Based Clustering, Classification, and Discriminant Analysis
The use of mixture models for clustering and classification has burgeoned into an important subfield of multivariate analysis. These approaches have been around for a half-century or so, with significant activity in the area over the past decade. The primary focus of this paper is to review work in model-based clustering, classification, and discriminant analysis, with particular attenti...
متن کاملOptimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملOptimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines
In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...
متن کاملEstimation and Model Selection for Model-Based Clustering with the Conditional Classification Likelihood
The Integrated Completed Likelihood (ICL) criterion has been proposed by Biernacki et al. (2000) in the model-based clustering framework to select a relevant number of classes and has been used by statisticians in various application areas. A theoretical study of this criterion is proposed. A contrast related to the clustering objective is introduced: the conditional classification likelihood. ...
متن کامل